Skip to content

feat: Add BijectionConverter and BijectionAttack (#1903)#1942

Open
sajisanchu1913-source wants to merge 16 commits into
microsoft:mainfrom
sajisanchu1913-source:feat/bijection-attack
Open

feat: Add BijectionConverter and BijectionAttack (#1903)#1942
sajisanchu1913-source wants to merge 16 commits into
microsoft:mainfrom
sajisanchu1913-source:feat/bijection-attack

Conversation

@sajisanchu1913-source

Copy link
Copy Markdown
Contributor

Summary

Implements the Bijection Attack from arXiv:2410.01294 (Haize Labs) into PyRIT.

The attack works by teaching a target LLM a secret character mapping through
demonstration shots, then sending harmful prompts encoded in that mapping to
bypass safety filters. Responses are decoded using the inverse mapping.

Changes

New Files

  • pyrit/prompt_converter/bijection_converter.py — generates random letter-to-letter mapping, encodes prompts, decodes responses
  • pyrit/executor/attack/single_turn/bijection_attack.py — runs full bijection attack with teaching phase
  • tests/unit/prompt_converter/test_bijection_converter.py — 11 unit tests for converter
  • tests/unit/executor/test_bijection_attack.py — 5 unit tests for attack
  • doc/code/executor/attack/bijection_attack.ipynb — usage notebook

Modified Files

  • pyrit/prompt_converter/__init__.py — registered BijectionConverter
  • pyrit/executor/attack/single_turn/__init__.py — registered BijectionAttack

How It Works

  1. BijectionConverter generates a random secret mapping (e.g. a→q, b→x...)
  2. BijectionAttack sends teaching messages to target AI to teach the mapping
  3. Harmful prompt is encoded and sent as TASK is '⟪encoded prompt⟫'
  4. Response is decoded using inverse mapping
  5. Decoded response is scored by the judge

Pattern Followed

  • BijectionConverter follows FlipConverter pattern
  • BijectionAttack follows FlipAttack pattern

Reference

sajisanchu1913-source and others added 12 commits May 28, 2026 17:14
- _RemoteDatasetLoader._fetch_zip_from_url:
  - keyword-only args (source, inner_files, cache)
  - streams download (requests stream=True + iter_content) to avoid
    double-buffering large archives
  - md5-keyed disk cache under DB_DATA_PATH / seed-prompt-entries when
    cache=True; named temp file otherwise (cleaned up after parse)
  - validates each inner_files extension against FILE_TYPE_HANDLERS;
    raises ValueError with a member preview if an inner file is missing
  - parses inner files via FILE_TYPE_HANDLERS and returns parsed dicts,
    so the open ZipFile never escapes the worker thread
  - adds the missing import zipfile that broke the previous commit
- _MICDataset:
  - drops unused io / json / requests imports (helper handles them)
  - delegates download + parse to the helper; only owns the seed
    construction loop
  - guards non-string Q values (in addition to NaN moral values)
  - forwards cache from fetch_dataset_async to the helper
  - factors authors into AUTHORS class constant
- Tests:
  - test_moral_integrity_corpus_dataset.py: stops mocking requests.get
    directly; patches _fetch_zip_from_url to return parsed dicts so
    tests don't depend on the helper's internal shape
  - adds test_fetch_dataset_non_string_q and
    test_fetch_dataset_passes_cache_flag
  - hoists imports into the right groups so ruff I001 stops firing
  - removes trailing whitespace / extra newlines
- test_remote_dataset_loader.py: adds TestFetchZipFromUrl covering
  happy path, on-disk caching (hits 1 network call across 2 fetches),
  cache=False does not persist, missing inner file raises ValueError,
  unsupported extension raises ValueError

Verified live against the real MIC.zip: 35,408 unique seeds across
all 6 moral foundations in ~2.4s cold / ~1.3s warm. All 559 dataset
unit tests pass; ruff clean.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Use tempfile.NamedTemporaryFile instead of fixed temp_audio.wav
  to prevent concurrent call collisions
- Wrap Azure upload in try/finally to ensure temp file is always
  deleted even when upload fails
- Add regression test to verify cleanup on upload failure

Fixes microsoft#1894
- Add BijectionConverter that generates random letter-to-letter mapping
- Add BijectionAttack that teaches the mapping to target AI and encodes harmful prompts
- Add unit tests for both converter and attack
- Add notebook demonstrating usage
- Update __init__.py files to register new classes

Based on arXiv:2410.01294 (Haize Labs bijection-learning)

@romanlutz romanlutz left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start! There are a few things that need addressing but we're pretty close.

Comment thread pyrit/executor/attack/single_turn/bijection_attack.py Outdated
Comment thread pyrit/prompt_converter/__init__.py Outdated
Comment thread tests/unit/executor/test_bijection_attack.py Outdated
Comment thread pyrit/prompt_converter/bijection_converter.py
Comment thread pyrit/prompt_converter/bijection_converter.py Outdated
Comment thread pyrit/models/data_type_serializer.py Outdated
Comment thread doc/code/executor/attack/bijection_attack.ipynb
Comment thread pyrit/executor/attack/single_turn/bijection_attack.py Outdated
Comment thread pyrit/prompt_converter/bijection_converter.py
Comment thread pyrit/executor/attack/single_turn/bijection_attack.py Outdated
- Remove @pytest.mark.asyncio decorators (asyncio_mode=auto)
- Fix __init__.py alphabetical ordering for BijectionConverter
- Use patch_central_database fixture in attack tests
- Use MagicMock(spec=PromptTarget) instead of plain MagicMock
- Remove dead num_digits parameter
- Add BijectionType StrEnum for bijection_type validation
- Use private attributes with underscore prefix
- Add _build_identifier() method
- Fix teaching shots cap with programmatic cycling
- Fix alternating user/assistant roles in teaching messages
- Fix response decoding in _perform_async
- Add BijectionConverter to _request_converters pipeline
- Fix notebook format and add paired .py jupytext file
- Register BijectionAttack in executor/attack/__init__.py
@sajisanchu1913-source

Copy link
Copy Markdown
Contributor Author

Hi @romanlutz I've addressed all the review comments:

  • Removed @pytest.mark.asyncio decorators
  • Fixed init.py alphabetical ordering
  • Used patch_central_database fixture in attack tests
  • Used MagicMock(spec=PromptTarget) instead of plain MagicMock
  • Removed dead num_digits parameter
  • Added BijectionType StrEnum for validation
  • Used private attributes with underscore prefix
  • Added _build_identifier() method
  • Fixed teaching shots cap with programmatic cycling
  • Fixed alternating user/assistant roles in teaching messages
  • Fixed response decoding in _perform_async
  • Added BijectionConverter to _request_converters pipeline
  • Fixed notebook format and added paired .py jupytext file

Ready for re-review!

@sajisanchu1913-source

Copy link
Copy Markdown
Contributor Author

Hi @romanlutz I've addressed the remaining review comments:

  • Resolved merge conflicts with upstream/main (kept BidiConverter from main, added BijectionConverter alphabetically)
  • Added end-to-end test in TestBijectionAttackEndToEnd that uses MockPromptTarget, returns a cipher-text response, and asserts the result is decoded back to plain text
  • Fixed ComponentIdentifier import to use pyrit.models.identifiers

Ready for re-review

prompt_normalizer: Optional[PromptNormalizer] = None,
max_attempts_on_failure: int = 0,
num_teaching_shots: int = 5,
bijection_type: str = "letter",

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two minor cleanups:

  1. bijection_type is typed as str on the attack but BijectionType on the converter. Line 42 should be bijection_type: BijectionType = BijectionType.LETTER so the public‑facing attack matches the converter's signature. The StrEnum still accepts the literal "letter" at runtime, but the type annotation lies as written.

  2. Optional[X] instead of X | None. Lines 37–39 use Optional[AttackConverterConfig], Optional[AttackScoringConfig], Optional[PromptNormalizer]. The codebase enforces PEP 604 (X | None) via ruff UP007/UP045 — pre‑commit will catch these. While you're at it, line 6 can drop Optional from the typing import.

# decode the response if there is one
if result.last_response and result.last_response.original_value:
decoded = self._bijection_converter.decode(result.last_response.original_value)
result.last_response.original_value = decoded

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking — mutating result.last_response.original_value corrupts the audit trail.

result.last_response.original_value = decoded
result.last_response.original_value = decoded

last_response is a reference to a Message that has already been written to CentralMemory by PromptSendingAttack._perform_async. This in‑place mutation overwrites the recorded target response so memory now shows the decoded plain‑English text as if the target had returned it directly — the actual cipher‑text response from the model is lost from the audit log.

For an attack whose entire purpose is to produce harmful content in obfuscated form, losing the real model output is a significant integrity problem: future runs can't be replayed, the cipher‑shape (which is the evidence the attack worked) is gone, and any downstream analysis sees only the post‑processed version.

The decoded value should be attached alongside the original, e.g.:

  • Add a converted MessagePiece (preferred — that's what the converter pipeline normally produces, and it's what response converters in the normalizer do automatically).
  • Or store the decoded text in AttackResult metadata (result.metadata["decoded_response"] = decoded) and leave original_value untouched.

Related: this is another argument for letting the converter pipeline handle decoding (via response converters on the normalizer) rather than doing it manually here — the pipeline already preserves the original and adds converted values without mutation.

def test_teaching_messages_contain_secret_code(self, mock_objective_target):
attack = BijectionAttack(objective_target=mock_objective_target)
messages = attack._build_teaching_messages()
assert "secret code" in str(messages[0]).lower()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brittle assertion — "secret code" is a literal string from the intro message that's quite likely to get reworded (e.g., if you switch the mapping turn to a system prompt per the paper, the wording will almost certainly change). The test will then break for a reason unrelated to what it's actually trying to verify.

Better to assert structural properties:

  • the first message has role="user" (or "system" after the fix)
  • the message count matches 1 + 2 * num_teaching_shots (intro + shot pairs)
  • subsequent messages alternate roles
  • shots contain the encoded form of examples[i]

These would catch real regressions (e.g., the alternating‑roles fix being undone) instead of just minor prompt rewording.

messages.append(Message.from_prompt(
prompt=f"{encoded} = {original}. Got it!",
role="assistant"
))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking — assistant teaching turns are plaintext English, not in‑cipher.

This change addresses my earlier comment about alternating roles, but the actual content of the assistant turns defeats the purpose. The paper (§2) specifies "in‑context User‑Assistant shots, with User messages in English and Assistant messages in the corresponding bijection language 'translation'". The current code does the opposite:

# line 92-95: ACK in plaintext English
"Understood! I will use this secret code in our conversation."

# line 109-119: user sends cipher and asks for confirmation,
# assistant replies in a half-cipher/half-English translation echo
f"In our code '{encoded}' means '{original}'. Understood?"   # user
f"{encoded} = {original}. Got it!"                            # assistant

The mechanism that makes the attack work is the assistant fluently producing cipher output — that's what induces the cipher‑shaped response distribution at inference time. Plain‑English ACKs plus cipher = plain translation echoes look to the model like "the user is showing me a translation key," not "I should produce text in this language."

Per the paper, the shot pattern should be:

  • User (English): "the quick brown fox"
  • Assistant (cipher): "ekt cvpjl mryio gyx"

And the ACK turn isn't needed at all — the paper just uses 10 translation shots, no separate acknowledgment.

LETTER = "letter"


class BijectionConverter(PromptConverter):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restructure recommendation: abstract BijectionConverter + 3 concrete subclasses, attack takes a converter instance.

After rereading the paper, the current single-class + BijectionType StrEnum design doesn't scale to what the paper actually requires. §2 specifies three bijection types — permuted alphabet, ℓ‑digit numbers, and tokens from the target's tokenizer — and explicitly notes their complexity parameters (fixed_size, ℓ, vocab subset) are what give the attack its scale‑adaptive property. So implementing only LETTER understates what the attack claims to do.

Stuffing per‑mode params (num_digits for digits, tokenizer for tokens) onto a single class produces dead‑param footguns (BijectionConverter(bijection_type=DIGITS, fixed_size=5) would silently mix modes). Subclasses give honest signatures:

class BijectionConverter(PromptConverter, abc.ABC):
    def __init__(self, *, mapping: dict[str, str] | None = None, seed: int | None = None) -> None:
        rng = random.Random(seed)
        self._mapping = mapping if mapping is not None else self._generate_mapping(rng)
        self._inverse_mapping = {v: k for k, v in self._mapping.items()}

    @abc.abstractmethod
    def _generate_mapping(self, rng: random.Random) -> dict[str, str]: ...

    async def convert_async(...): ...   # shared
    def decode(...): ...                # shared
    def _build_identifier(...): ...     # shared

class LetterBijectionConverter(BijectionConverter):
    def __init__(self, *, fixed_size: int = 0, mapping=None, seed=None): ...

class DigitBijectionConverter(BijectionConverter):
    def __init__(self, *, num_digits: int = 2, mapping=None, seed=None): ...

class TokenBijectionConverter(BijectionConverter):
    def __init__(self, *, tokenizer, mapping=None, seed=None): ...

The base class gets two things that are needed now, not just for future modes:

  • seed — currently random.shuffle uses the global RNG with no way to reproduce a mapping. Red‑team work needs replay; a seed parameter (constructing a local random.Random(seed)) is the standard fix.
  • mapping — accept an explicit dict[str, str] so callers can replay a known successful mapping or run deterministic experiments. The test file currently works around the lack of this by reading converter.mapping after random generation, which is awkward.

BijectionAttack then simplifies dramatically — it just accepts a converter instance:

class BijectionAttack(PromptSendingAttack):
    def __init__(
        self,
        *,
        objective_target: PromptTarget = REQUIRED_VALUE,
        bijection_converter: BijectionConverter = REQUIRED_VALUE,  # this could also be None and have the letter version as default
        num_teaching_shots: int = 10,
        ...
    ): ...

This drops bijection_type, fixed_size, and the type‑confused bijection_type: str = "letter" annotation. The user composes:

attack = BijectionAttack(
    objective_target=target,
    bijection_converter=DigitBijectionConverter(num_digits=2, seed=42),
)

I'd push for all three modes in this PR — landing only LETTER and adding the rest later means either a breaking API change (when the inevitable bijection_type / per‑mode params get reshuffled) or sticking with the current sub‑optimal design. The architectural cost is paid once now; the alternative is paying it twice.

Acknowledging this is a bigger ask than my earlier comments — happy to discuss if you'd prefer to land LETTER only and follow up, but I think the restructure is worth it.

- Change Optional[X] to X | None (PEP 604)
- Change bijection_type: str to BijectionType in attack
- Register BijectionType in prompt_converter __init__.py
- Store decoded response in metadata instead of mutating last_response
- Fix teaching shots: user sends English, assistant responds in cipher
- Fix brittle test assertions to check structural properties
- Update end-to-end test to check metadata for decoded response
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

FEAT Bijection

2 participants